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A classification of companies into sectors of the economy is important for macroeconomic analysis 
and for investments into the sector-specific financial indices and exchange traded funds (ETFs). 
Major industrial classification systems and financial indices have historically been based on expert 
opinion and developed manually. Here we show how unsupervised machine learning can provide a 
more objective and comprehensive broad-level sector decomposition of stocks. An emergent low¬ 
dimensional structure in the space of historical stock price returns automatically identifies “canonical 
sectors” in the market, and assigns every stock a participation weight into these sectors. Further¬ 
more, by analyzing data from different periods, we show how these weights for listed firms have 
evolved over time. 


Stock market performance is measured with aggre¬ 
gated quantities called indices that represent a weighted 
average price of a basket of stocks. Market-wide indices 
such as Russell 3000 [1] and the S&P 500 [2] consist of 
stocks from diverse companies reflecting a broad cross- 
section of the market. Sector-specific indices such as 
the Dow Jones Financials Index [3], CBOE Oil Index 
[4] and the Morgan Stanley High-Tech 35 Index [5], etc., 
are more granular and their composition requires a clas¬ 
sification of companies into sectors. Major industrial 
classification schemes classify firms into sectors, albeit 
with many ambiguities [6]. It is not clear, for exam¬ 
ple, how to assign a sector to conglomerates or diver¬ 
sified companies such as General Electric. Conversely, 
non-conglomerates with exposure to firms outside their 
own sector (for example, an investment bank exclusively 
serving pharmaceutical firms) also blur the boundaries 
of sector-identification. Moreover, as economic environ¬ 
ment or companies evolve, neither the industrial sectors 
nor the firms’ sector association remains static, necessi¬ 
tating updates to sector assignments and addition of new 
sectors. 

A significant number of studies have previously aimed 
at finding categories of stocks in financial markets with a 
variety of approaches. Recent numerical techniques have 
included extensive use of random matrix theory, principal 
component analysis or associated eigenvalue decomposi¬ 
tion of the correlation matrix [10-15], specialized clus¬ 
tering methods [16-22] or time series analysis [23, 24], 
pairwise coupling analysis [25], and even topic-modeling 
of returns [26]. Indeed, relevant prior work analyzing 
historical stock price returns [10, 27, 28] elucidated that 
the high-dimensional space of stock price returns has a 
low-dimensional representation. 

In parallel with this, there is a long tradition of style 
analysis in finance in which time series can be selected 
which serve as useful benchmarks for the performance 
of other stocks or indices. The 3-factor model of Fama 
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FIG. 1. Low-dimensional projection of the stock price 
returns data. Stock price returns are projected onto a plane 
spanned by two stiff vectors from the SVD of the emergent 
simplex corners as described in the supplementary online in¬ 
formation [7]. Each colored circle corresponds to one of the 
705 stocks in the dataset used in the analysis. Colors denote 
the sectors assigned to companies by Scottrade [8] and the 
scheme is shown in (Fig. S6). The grey corners of the sim¬ 
plex correspond to sector-defining prototype stocks, whereas 
all other circles are given by a suitably weighted sum of these 
grey corners. Projections along other singular vectors are 
shown in (Fig. S2). 


and French [28] is one such example. Recently, D. Vis- 
tocco and C. Conversano [29] proposed that Archetypal 
Analysis (AA) [30] could provide these benchmark time 
series while also providing a way to plot this data in a 
meaningful way. In particular, they provide a triangular 
plot for Italian mutual funds and suggest parallel coor¬ 
dinate plots or asymmetric maps for higher dimensional 
representations. 
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FIG. 2. Canonical sector decomposition of stocks of 
selected companies. A complete set of all 705 stocks is pro¬ 
vided on the companion website [9]; the color scheme is shown 
on the right. Conglomerates like GE decompose roughly into 
their core business lines. Tech hrms such as Apple that sell 
mass-market consumer goods have an important fraction in 
c-cyclical^ whereas IBM has a signihcant portion of c-non- 
cyclical returns presumably due to its government contracts. 
Telecom companies like AT&T are generally classihed under 
a separate telecom category by major classihcation systems, 
yet analysis shows their returns are described by a combina¬ 
tion of c-non-cyclical and c-utility sectors. Health insurance 
providers like Aetna are commonly classihed as hnancial ser¬ 
vices hrms, but their returns consist of a major part c-non- 
cyclical and only a minor part of c-financial —the healthcare 
sector is generally less prone to economic downturns. Defense 
contractors like Lockheed are listed as capital goods compa¬ 
nies, but their returns are seen to be majority c-non-cyclical 
and only a smaller share of c-industrial sector. 


Here, we demonstrate a new, holistic way of classify¬ 
ing stocks into industrial sectors by utilizing the emer¬ 
gent structure of price returns in data space. Beyond 
the proposal of Vistocco and Conversano, we provide 
an interpretation of the archetypes of AA as sectors of 
the economy. This structure is purely contained in the 
geometry of the time series. Other methods, such as 
SVD, can discern that there is some such structure but 
are not well suited to a clean description. Archetypal 
Analysis, on the other hand, determines the convex hull 
of the dataset making it uniquely suited to creating a 
quantitative analysis of the data. In particular, if we 
take the log price returns of individual stocks, remove 
the overall market return, normalize to zero mean and 
unit s.d., then stock returns are well-approximated by 
a hyper-tetrahedral structure. Each lobe of the hyper¬ 
tetrahedron is populated by stocks of similar or related 
businesses (Fig. 1); the lobe-corners {canonical sectors) 
approximate the returns of companies that are prototyp¬ 
ical of individual sectors (Table 1). Returns of each stock 
can be decomposed into a weighted sum (Fig. 2) of the 
canonical sector returns (Fig. 3). Lastly, the canonical 
sector weights for a given company are dynamic and lead 
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FIG. 3. Emergent sector time series. Annualized cu¬ 
mulative log price returns of the eight emergent sectors are 
shown. The time series capture all important features af¬ 
fecting different sectors: building-up of the dot-com bubble 
(c. 2000) followed by a burst, the soaring energy valuations 
(2003-08) followed by a crash, and financial crisis of 2008. 
We note that the dot-com bubble was confined to the c-tech 
whereas the hnancial crisis effects were spread throughout the 
sectors. Precise definition of the cumulative returns plotted 
here is given in (Eqn. S2); other measures of sector dynamics 
are in (Fig. S4). 


to insights into its evolution (Fig. 5). 

The matrix of daily log returns of a stock s are defined 
as Vts = \og Pts — log^(t-i)s where Pts are adjusted clos¬ 
ing prices {i.e. corrected for stock splits and dividend 
issues) and t is in trading days. In the present analy¬ 
sis, we used normalized returns, R'f.g = {rts — {rts)t)/(^s^ 
where = {r^g)t — {ps)t ^^le variance (squared volatil¬ 
ity). Overall market returns from each stock were also 
removed, yielding Rfs = R'ts ~ The hyper¬ 

tetrahedron, or simplex, which emerges (Fig. 1) is a 
self-organized structure: it has prototypical firms in cor¬ 
ners (Table 1), closely related firms clumped together in 
each lobe, diversified companies (GE, Walt Disney, 3M, 
etc.) close to the center, and the number of lobes denot¬ 
ing how many distinct sectors are exhibited by the data. 
This suggests a natural way to decompose stocks into 
canonical sectors: for convex sets, each interior point is 
representable as a unique weighted sum of corner points, 
implying here that every stock’s return is approximated 
by a weighted sum of returns from the canonical sectors. 
Conversely, the weights for a given stock quantify its ex¬ 
posure to the canonical sectors. 

We applied an in house python implementation of the 
A A algorithm described by Mqrup and Hansen [34]. The 
dataset consisted of 705 US firms’ stocks with a mini¬ 
mum $1 billion June 2013 market capitalization and with 
continuous 20 years (1993-2013) of listing on major ex¬ 
changes. Analysis of this dataset revealed eight emer¬ 
gent sectors which were named in accordance with the 
companies they comprised (prefix c- denotes “canoni¬ 
cal”): c-cyclical (including retail), c-energy (including 
oil and gas), c-industrial (including capital goods and 
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FIG. 4. Changes in the decomposition with dimensionality. A Sankey diagram (generated using D3 [31]) displaying 
the relationships between sector decompositions with n = A" + 1 and n = A Relative node sizes correspond roughly to the 
amount of the market participating in the sector. Connection width depicts how strongly the sectors for decompositions with 
different n relate. For details, see Sector Changes with Dimensionality. 


Canonical sector 

Business lines 

Prototypical examples 

c-cyclical 
c-energy 
c-financial 
c-industrial 
c-non-cyclical 
c-rcal estate 
c-technology 
c-utility 

general and specialty retail, discretionary goods 
oil and gas services, equipment, operations 
banks, insurance (except health) 
capital goods, basic materials, transport 
consumer staples, healthcare 
realty investments and operations 
semiconductors, computers, comm, devices 
electric and gas suppliers 

Gap, Macy’s, Target 
Halliburton, Schlumberger 

US Bancorp., Bank of America 
Kennametal, Regal-Beloit 
Pepsi, Procter & Gamble 

Post Properties, Duke Realty 
Gisco, Texas Instruments 
Duke Energy, Wisconsin Energy 


TABLE I. Canonical sectors and major business lines of primary constituent firms. The eight canonical sectors 
identified by the analysis described here are listed in the column on the left; these were named in accord with the business lines 
(middle column) of firms that show strong association with these sectors. Some examples are provided in the right column; a 
full list is available on companion website [9]. 


basic materials), c-financial^ c-non-cyclical (including 
healthcare and consumer non-cyclical goods), c-real es¬ 
tate^ c-technology^ and c-utility. Calculated participation 
weights for a sample of 12 firms in (Fig. 2) show a decom¬ 
position of their stocks into the canonical sectors with 
resulting insights discussed in the caption. Associated 
with each canonical sector / is a time series of returns. 
As expected, these series show hallmark historical events 
of individual sectors (Fig. 3): the dot-com bubble, the en¬ 
ergy crisis, and the financial crisis being the major events 
in the last two decades. 

Determining the correct number of canonical sectors 
that appropriately describe the space of stock market 
returns is akin to the more general issue of selecting a 
signal-to-noise ratio cutoff, or a truncation threshold in 
the dimensional-reduction of data. The choice of this 
threshold is generally sensitive to sampling, yet the re¬ 


sults presented here are reasonably robust with differ¬ 
ent choices leading to meaningful and similar decompo¬ 
sitions. Fig. 4 depicts the changes in the decomposition 
with dimension. Details of how the figure was generated 
as well as more information on the two and three dimen¬ 
sional decompositions are available in the Supplemental 
Material [7]. 

In addition to the full data set of 20 years x 705 firms, 
we also applied the algorithm to overlapping, two-year 
Gaussian windows to study to how the sector weights for 
firms have evolved in time (Fig. 5). As expected, the sec¬ 
tor decomposition of firms is dynamic. Mergers, acqui¬ 
sitions, spin-offs, new products, effect of competitive en¬ 
vironments or shifting consumer preferences can change 
the business foci of firms and hence alter the sector asso¬ 
ciation of firms. External events affecting companies in 
an idiosyncratic manner also show clear signature in this 
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FIG. 5. Evolving sector participation weights. Re¬ 
sults from the sector decomposition made with rolling two- 
year Gaussian windows are shown for selected stocks. A com¬ 
plete set of 705 charts is provided on the companion website 
[9]. Golor scheme is as in (Fig. 2). For stable and focused 
companies such as Pacihc Gas & Electric or IBM, one sees 
no significant shifts in sector weights; changes in time agree 
with errors expected from unresolved fluctuations [9]. Wal- 
Mart’s returns, on the other hand, have moved signihcantly 
from c-cyclical to c-non-cyclicals (consumer staples) in the 
post-hnancial crises years as shown; this is also true of other 
low-price consumer commodities retailers such as Gostco, but 
not true of higher price retailers such as Whole Foods, Macy’s, 
etc. Gorning, previously an industrial hrm with a huge pres¬ 
ence in optical hber, suffered in the aftermath of the dot¬ 
com crisis and now is classihed as a tech hrm presumably 
due to its Gorilla® glass used in cellphones, laptop displays, 
and tablets. Berry Petroleum grew within its home state of 
Galifornia in the early 1990s through development on proper¬ 
ties that were purchased in the earlier part of 20th century. 
In 2003, the company embarked on a transformation [32] by 
direct acquisition of light oil and natural gas production fa¬ 
cilities outside Galifornia. The hgure shows a clear shift in 
the distribution of sector weights as the company has moved 
toward c-energy and away from c-real estate. Similarly, as 
Plum Greek Timber converted to a real estate investment 
trust (REIT) in the late 1990s [33], its sector weights have 
signihcantly shifted toward c-real estate sector. 


analysis. 


The eight-factor decomposition presented here ex¬ 
plains 11.1% of the total variation (r^) in the normalized 
returns with the market mode removed, and 56% of the 
random matrix theory explainable variation dehned in 
[9]. For comparison, the classic three-factor decomposi¬ 
tion portfolio returns by Fama and French [28] into mar¬ 
ket mode, market capitalization, and growth versus value 
yields an value of only 4.75%. Indeed, if only three 
factors are used instead of the eight for the decomposi¬ 
tion presented here, the regression yields a comparable 
value (5.61%) but there appears to be no correspondence 
between three factors found by our unsupervised model, 
and those of Fama and French (Fig. S8). Carrying out 
a similar comparison with Fama and French’s analysis 
applied to model portfolio returns, the regression on the 
S&P500 yields an value of 99.4% for Fama and French 
compared to 93.5% for our eight-factor decomposition 
(market mode reintroduced). Our decomposition was op¬ 
timized without concern for market capitalization, which 
appears to be the key difference: For an equal weighted 
index of the 338 stocks in the S&P500 with current tick¬ 
ers and a complete data series in our time of interest, we 
obtain an value of 99.0% (97.0% for 3 factors) com¬ 
pared to 95.8% for Fama and French. 

Future work remains to address survivorship bias, ef¬ 
fects of sampling at different frequencies, and incorporat¬ 
ing market capitalization. Investors, analysts, and gov¬ 
ernments alike would benefit from the development of 
new investable sector indices [7] that measure the health 
of our industrial sectors just like the macroeconomic in¬ 
dicators (GDP, housing starts, unemployment rate, etc.) 
measure the health of our broader economy. Tracing the 
sectors back in time [ArchetypalEvolution] could eluci¬ 
date the incorporation of science and technology into our 
economic system. Finally, our unsupervised decomposi¬ 
tion could provide data suitable for quantitative model¬ 
ing of the internal and external dynamics of our economic 
system. 
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2014). We thank Jean-Philippe Bouchaud, Ming Huang 
and Janet Gao for helpful discussions. 
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DATASET PARTICULARS 

Company names, tickers, listed-sectors and market 
caps of US-based firms used in this analysis were ob¬ 
tained from Scottrade [1]. Daily closing prices adjusted 
for stock splits and dividend issues were obtained from 
Yahoo Finance [2]. The rare cases of missing prices in 
the time series were replaced with linearly interpolated 
values. A brief summary of listed sectors and number of 
companies in each is provided in (Table SI) and a full list 
of company names, tickers, market caps and listed-sector 
info is available on the companion website [3]. 


Listed sector 

Companies 

Basic materials 

58 

Capital goods 

61 

Consumer cyclical 

41 

Consumer non-cyclical 

40 

Energy 

42 

Financial (-hReal estate) 

138 

Healthcare 

53 

Services (+Retail) 

101 

Technology 

93 

Telecom 

6 

Utility 

57 

Transport 

15 

TOTAL 

705 


TABLE L Listed sectors and number of companies 
dataset analyzed. Tickers for each company were obtained 
from [1]. 


RETURNS FACTORIZATION AND SECTOR 
DECOMPOSITION 

A variety of factorization algorithms have been devel¬ 
oped in recent years for dimensional reduction, classifica¬ 
tion or clustering. Examples include archetypal analysis 
(AA) [4], heteroscedastic matrix factorization [5], binary 
matrix factorization [6], K-means clustering [7], simplex 
volume maximization [8], independent component analy¬ 
sis [9], non-negative matrix factorization (NMF) [10, 11] 
and its variants such as the semi- and convex-NMF [12], 
convex hull NMF [13] and hierarchical convex NMF [14], 
among others. Each method has a unique interpretation 


[15] and therefore, a successful application of any of these 
methods is contingent upon the underlying structure of 
the data. 

The hyper-tetrahedral structure of log price returns 
seen in our analysis motivates a decomposition so that 
each stock’s return is a weighted mixture of canonical 
sectors, constrained to he in the convex hull of the data. 
Hence we employ A A factorization which is defined as: 

Rts ^ Rts'Cs'fWfs 

^s'f ^ ^s'f = I5 (1) 

Wfs>0,EfWfs = l. 

Columns of RtsCgf = Rtf are the emergent sector time 
series (basis vectors) representing the n corners of the 
hyper-tetrahedron, and Wfs are the participation weights 
(bF/s ^ 0) in sector / so that LF/s = 1 for each stock 
s. The sector matrix E^f is within the convex hull {C > 
0, Csf = 1) of the data Rts- It can be found by either 
minimizing the squared error with convex constraints in 
factorization as originally proposed [4], or by making a 
convex hull of the dataset and choosing one or more of its 
vertices to be basis vectors, or by making a convex hull in 
low-dimensions and choosing one or more of its vertices 
to be basis vectors [16], or by minimizing after initializing 
with candidate archetypes that are guaranteed to lie in 
the minimal convex set of the data [17]. The columns of 
the C matrix are shown in (Fig. S8). 

CALCULATIONS AND CONVERGENCE 

Numerical computations were performed using an in- 
house Python language implementation of the principal 
convex hull analysis (PCHA) algorithm as described in 
[17]. For the full dataset, the factorization R = EW, 
with E = RC as defined in (Eqn. SI) converged in 35 it¬ 
erations to a predefined tolerance value of ^SSE < 10 
where Asse is the average difference in sum of square 
error per matrix element in R — EW from one iteration 
to the next. The resulting columns of Etf are shown in 
(Fig. S5) (top row). Annualized cumulative log returns 
are obtained by summing rows of Ef/: 

p) 
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The time series Q/(t) are shown in (Fig. 3) and the 
middle row of (Fig. S5). Weights Wfs for selected stocks 
are shown in (Fig. 2), the remainder are available on the 
companion website [3]. In each canonical sector /, the 
component of weights for companies are shown in (Fig. 
S6). 

The analysis of evolving sector weights was performed 
similarly, but with a sliding Gaussian time window. We 
decomposed the local normalized log returns for each 
stock into the canonical sectors determined from the en¬ 
tire time series. Each column (time series) of the returns 
matrix Rts was multiplied with a Gaussian, G^{t) = 
exp( —(r — /i)^/(2 X 250^)) of standard deviation 250 
centered at /r to obtain We use Cg'f found using 

the full dataset (Eqn. SI) (corresponding to keeping 
the sector-defining simplex corners fixed). R^^ is fac¬ 
torized to obtain new weights that describe sector 
decomposition of stocks in that period focused at t = /i: 
R^ = R^^,Cs' fW^^. Ijl is increased in steps of 50 starting 
at /i = 0 and ending at /i = 5000, and is calculated 
at each /i with the corresponding R^. These results are 
plotted in (Eig. 4) for a select group of companies; the 
remainder are available on the companion website [3]. 

To address the challenge of distinguishing signal from 
noise in the evolving sector weights, we simulate data to 
which we add noise and then compare. This was done 
by repeating the analysis for the flows where the com¬ 
panies from Eigure 4 were replaced. Eor each of these 
companies, we took its sector weights, cJj, and multi¬ 
plied by Etf to obtain a time series for the company 
with weights that are constant in time. We then added 
gaussian random noise with standard deviation one and 
replaced these companies by this simulated data. Eigure 
SI shows the comparison between the real flows from the 
main text and the simulated constant data with noise 
added. General features descibed in the text are shown 
to be signal while small fluctuations are consistent with 
noise. 


DIMENSIONALITY OF THE SPACE OF PRICE 
RETURNS 

It is often the case with large datasets that the effec¬ 
tive dimensionality of the data space is much lower when 
one filters out the noise. Of the many dimensional reduc¬ 
tion methods, the most commonly used is singular value 
decomposition (SVD) [18], a deterministic matrix factor¬ 
ization. We discuss SVD in more detail in order to draw 
a contrast with previous SVD results, and to apply it for 
quantifying the explainable variation in the returns data. 

An SVD of Rts is a matrix factorization [18] Rts = 
UtfEf such that matrices U and V are orthogonal; 
E is a diagonal matrix of “singular values”. If the goal 
were purely rank-reduction, n entries of E chosen to lie 
above “noise threshold” are retained and the rest trun¬ 


cated so that 0 < /, f' < n. This effectively reduces 
the dimension of R to n. The choice of n can be in¬ 
formed by the distribution of singular values as discussed 
later. The rows of are precisely the eigenvectors of 
the stock-stock returns correlation matrix, ^ 

It was previously reported that some components of the 
stiff eigenvectors of this stock-stock correlation matrix 
loosely corresponded to firms belonging to the same con¬ 
ventionally identified business sector [19] (but see Fig. 
S7). 

After normalizing the log returns, the returns matrix 
R has entries of unit variance. If the entries were uncor¬ 
related random variables drawn from a standard normal 
distribution, their singular values (which are also the pos¬ 
itive square roots of the eigenvalues of R^R) would be 
described by Wishart statistics [20]. The Wishart ensem¬ 
ble for a matrix of size a x P predicts a distribution of 
singular values with a characteristic shape [20], bounded 
for large matrices by ^/a ± ^/p. Gomparing the stock 
correlations with Wishart statistics has been previously 
used to filter noise from financial datasets [21]. As shown 
in the (Fig. S2), most singular values of the returns ma¬ 
trix R lie in the bulk below the bound set by the Wishart 
ensemble, whereas only ^20 fall outside that cutoff (The 
singular value bounds of a random Gaussian rectangular 
matrix of size a x (3 can be shown to be ^/a ± ^ for 
large matrices.) Historically, this has served as indication 
that singular values within the bulk correspond to noise 
[21]. Recently, however, much progress has been made 
in the development of techniques to extract signal from 
the bulk [22-24]. Our method does not claim to capture 
this information. Rather, we measure its ability to cap¬ 
ture variation in the data above the cutoff by means of 
random matrix theory explainable variation as defined in 
Coefficient of Determination. The largest singular value 
of Rts corresponds to what we will refer to as the “mar¬ 
ket mode” as this represents overall simultaneous rise and 
fall of stocks. In the analysis presented in this paper, this 
mode has been filtered from the returns matrix by pro¬ 
jecting the R matrix into the subspace spanned by all 
non-market mode eigenvectors. This is nearly equivalent 
to filtering the market mode using simple linear regres¬ 
sion (as done commonly [19]), although more convenient. 

LOW-DIMENSIONAL PROJECTIONS OF PRICE 
RETURNS 

The emergent low-dimensional, hyper-tetrahedral 
(simplex) structure of stock price returns can be seen 
by projecting the dataset into stiff “eigenplanes”. Eigen- 
planes are formed by pairs of right singular vectors from a 
SVD. Here, we construct an SVD of the simplex corners, 
Etf = XtkY simplex corners are mapped to columns 
of YZ^ because = X^ffitf (in other words, is 

a projection operator). The plots in (Eig. S3) are the 
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FIG. SI. Comparison between flow diagrams presented in Figure 4 of main text with simulated data. The simulated data is 
created from the dot product of the weight vector of the company with the corner time series as described in Calculations and 
Convergence. This yields a version of the company with constant weights in time. To this we add gaussian noise with standard 
deviation one and repeat the analysis to generate the flows in time. In the left column are the actual flows for companies, on 
the right is their constant in time counterpart with added noise. We see that key features noted in the main text are in fact 
signal while small fluctuations correspond to noise. 


projections of the dataset, X'^^Rts = Vks- The rows of v 
taken in pairs form the axes of the projections in (Figs. 
1 and S3). With those plots, it becomes clear that the 
eigenplanes represent projections of a simplex-like data 
into two-dimensions. Secondly, we note that the simplex 
structure becomes less clear as one looks at planes corre¬ 
sponding to smaller singular value directions; the signal 
eventually becomes buried in the noise. 

Similarly, the results of the factorization can be seen 
in eigenplanes from the SVD of EtfWgf = LtkMNg. 
These results (rows of MN'^^) are shown in (Fig. S4), 
where we notice that the data is now perfectly resides in 
simplex region as expected due to constraints. 

COEFFICIENT OF DETERMINATION (r^) 

We measured the goodness of the returns decomposi¬ 
tion R = EW hy measuring the coefficient of determina¬ 
tion (r^) as follows: 

= 1 - SSE/SST (3) 

Here, SSE is denotes the sum of square errors 
\\R — EW\\‘^p, and SST is the total sum of squares 
||R|||.. This is also known as the proportion of variance 


explained (PVE). Eor the factorization of the full 
dataset, normalized with the market mode removed, 
the calculated value is 11.1%. The SVD of R with 
singular values shown in (Eig. S2) provides a convenient 
way to put this number in context for the returns 
dataset. Only 20 singular values (excluding the market 
mode) were above the cut-off that was predicted by 
random matrix theory for a matrix of purely random 
Gaussian entries. Eor any matrix M with elements 
the norm ||M|||. = j where Si are 

the singular values [18]. Thus, the fraction of intrinsic 
variation in R above the cutoff is the sum of squares 
of the 20 singular values (not including market mode) 
divided by SST, /II^IIf = 19.8%. Therefore, 

as a first approximation, the factorization explains 
11.1/19.8 = 56% of the random matrix theory (RMT) 
explainable variation. 

Eor reference we provide the RMT explainable varia¬ 
tion for the factor decomposition of Eama and Erench, the 
classification by Scottrade, and the top 8 singular vectors 
given by SVD. The percentage of the RMT explainable 
variation for different numbers of factors compared to the 
3 factor decomposition of Eama and Erench is shown in 
(Table S2). Eama and Erench have the benefit of allow¬ 
ing factors to have positive or negative weights. In order 
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Bulk Variation 


80.2% 


Ticker Company Name 


Label 


Explainable Variation 

19.8% 


Factors 

Percent of Explainable Variation 

Market Mode (MM) 

8.0% 

2 factors + MM 

26.0% 

3 factors + MM 

36.1% 

4 factors + MM 

42.8% 

5 factors + MM 

48.9% 

6 factors + MM 

55.3% 

7 factors + MM 

59.4% 

8 factors + MM 

63.7% 

9 factors + MM 

68.1% 

Fama and French 

24.0% 


EQT 

EQT Corporation 

Energy 

RDN 

Radian Group Inc. 

Financials 

STT 

State Street Corporation 

Financials 

LH 

Laboratory Corp. of America Holdings 

Healthcare 

UHS 

Universal Health Services Inc. 

Healthcare 

STZ 

Constellation Brands Inc. 

Non-Cyclicals 

CNL 

Cleco Corporation 

Utilities 

OKE 

ONEOK Inc. 

Utilities 

CAKE 

The Cheesecake Factory Incorporated 

Cyclicals 

EFX 

Equifax Inc. 

Industrials 

ESRX 

Express Scripts Holding Company 

Non-Cyclicals 


TABLE III. Companies which form a new sector when the 
dimensionality of the decomposition is increased from n = 8 
to n = 9. The labels given are those indicated by Scottrade. 


TABLE 11. Percentage of the Explainable Variance captured 
by our model compared with the Eama and Erench factor 
model. Regression is done on the normalized dataset of 705 
stocks without the market mode removed. To capture this, 
we add the market mode to factors obtained by our decom¬ 
position. 

to compare with another non-negative decomposition, we 
fix the weight matrix according to the Scottrade labels 
and run archetypal analysis for this n = 14 factor ver¬ 
sion. The value for this decomposition is 10.7% with a 
corresponding RMT explainable variance of 54.2% com¬ 
pared to 56% for our 8 factors. For completeness, we also 
note that if R is rank-reduced to the eight stiffest compo¬ 
nents found by SVD (not including market mode), then 
the factorization explains 85% of the the RMT explain¬ 
able variation in R with overall results in good accord 
with the analysis presented here. This implies that sec¬ 
tor decomposition information was already contained in 
the stiff modes from the SVD of i7, however SVD is not 
the appropriate tool for the decomposition. 

THE NUMBER n OF CANONICAL SECTORS 

It is an open problem to determine the effective dimen¬ 
sionality (optimal rank) of a general dataset (matrix). 
One could select among models of different dimensions 
using statistical tests such as the discussed above, or 
information theory based criteria such as Akaike Infor¬ 
mation Criterion (AIC) or the Bayesian Information Cri¬ 
terion (BIC), but the choice of the selection criterion is 
itself generally made on an ad hoc basis. Therefore, a 
direct observation of the comprehensibility of results is 
often the most reliable criterion. In the dataset used 
for analysis described here, a factorization with n > S 
yielded results where both the emergent time series Eff 
and weights in Wfs showed qualitative signs of overfit¬ 
ting. For example, with n = 9 the results were in good 


agreement with n = 8 except for an additional resulting 
sector involving participation from only 11 seemingly un¬ 
related stocks (Table S3 and Figure 5). The high-level 
results of factorization with different values of n may be 
explored in a number of ways, several of which are de¬ 
scribed below. 


Sector Changes with Dimensionality 

One approach to investigating how the sector decom¬ 
position changes with dimension is to produce a flow 
diagram. To do this, we performed the fit \\Etj — 
Etj'Sf'j\\‘jp with the constraint = 1. Hence 

the sectors for n = 9 can be expressed as a linear com¬ 
bination of sectors for n = 8, n = 8 as a linear combi¬ 
nation of n = 7, and so forth. The results of these fits 
are presented in Figure 5. The figure represents these 
relationships though connections between the decompo¬ 
sitions for n = V + 1 and n = N weighted according to 
the matrix 5 '(VA^+i) More precisely, we create a node 
corresponding to each of the 9 sectors whose size is pro¬ 
portional to Wf^s where W/,s is the weight matrix 
for the 9 sector decomposition. Hence, the relative node 
sizes represent the amount of the market particpating in 
the sector. Multiplying this vector by gives the 

approximate size for each node in n = 8. Multiplying 
this vector by gives the approximate size for each 

node in n = 7, and so on. In this way, we generate a 
Sankey diagram whose node sizes correspond roughly to 
the amount of the market in the sector and whose connec¬ 
tions depict how strongly the sectors for decompositions 
with different n overlap. In the image, we see that the 
n = 9 decomposition gives the 8 sector version with an 
additional small sector whose companies were listed in 
Table S3. We also see that for n = 7 c-finance and c- 
real estate merge. At n = 6, c-industrial and c-cyclical 
merge. For n = 5, the new sector containing c-industrial 
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and c-cyclical merges with c-non-cyclical. For n = 4, c- 
utility and c-energy merge. Finally, for n = 3 and n = 2, 
no clear pattern emerges given this image alone. 

Two and Three Sector Decompositions 

We further explore the two and three sector decom¬ 
positions by examining their constituent companies and 
looking at pie charts describing the relationship between 
our 8 sector decomposition and those with n = 2 and 
n = 3 respectively. Recall that each archetype is con¬ 
strained to be a linear combination of companies, or in 
other words to lie in the convex hull of the data. Using 
this information, we list the 20 companies which con¬ 
tribute the most to each sector in the two and three fac¬ 
tor decompositions (Tables S4, S5 and S6). For the two 
sector decomposition, we find the sectors divide roughly 
into c-assets (e.g. financial and real estate companies) 
and c-goods (e.g. companies which provide goods and 
services). For n = 3, the division is less clear. Another 
way to look at the constituents of these sectors is by 
examining pie chart representations of these decomposi¬ 
tions. Again consider the fit \\Etj — EtjfSffj\\‘^ with 
the constraint Sffj = 1. Applying this, we can ex¬ 
press the two sector archetypes as linear combinations of 
the 8 sector archetypes and vice versa. Additionally, we 
can do the same for the three factor decomposition. The 
pie charts these fits produce are shown in Figure SIO. 
The results are consistent with the sector breakdowns 
described from examining the constituent companies. 

Robustness 

In general, a factorization analysis of the returns 
dataset would be sensitive to number of stocks in the 
dataset, criteria applied for picking stocks, period over 
which historical prices are obtained, and frequency at 
which returns are computed. A robust macroeconomic 
analysis would therefore require a large number of stocks 
chosen without sampling bias, with returns calculated 
over the period of interest and sensitivity checked for fre¬ 
quency of returns calculation. On the other hand, an 
equity fund manager faces a less daunting task for an 
analysis that is limited to the universe of her portfolio of 
stocks: either to find its canonical sectors, or to analysis 
the exposure of her holdings to the core sectors of the 
economy. 

CANONICAL SECTOR INDICES 

The matrix Cgf in decomposition R = ROW repre¬ 
sents how returns R of stocks 5 must be combined to 
make canonical sector returns Etf = RtsCgf- Since a 


canonical sector is defined as a combination of stocks, an 
investment in the sector / can made via buying a basket 
of constituent stocks s in proportions given by Cgf or 
through an index Itf. 

Itf=Pts'Cs'f (4) 

where, p are stocks prices suitably weighted by market 
cap or other divisor as common practice for common in¬ 
dices [25]. An unweighted index of this kind is shown 
in the bottom row of (Fig. S5) for results corresponding 
to the analysis described in this paper. Conversely, a 
pre-defined basket of stocks such as the S&P 500 can be 
unbundled to find its exposure to the canonical sectors. 
With an investment strategy employing longs and shorts 
at the same time in correct proportions, it is conceivable 
to invest in, for example, the c-tech component of S&P 
500. 

The desirable features of an index include complete¬ 
ness, objectivity and investability [26]. The c-indices 
constructed using the ideas outlined here would not only 
be of value to investors through investment vehicles such 
as ETFs, Futures, etc., but also serve as important eco¬ 
nomic indicators. 
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c-assets 

label 

percent full name 

c-goods label 

percent full name 

DDR 

real estate 

1.77% 

DDR Corp. 

HON 

tech 

0.53% 

Honeywell International Inc. 

ONB 

financial 

1.7% 

Old National Bankcorp. 

TMO 

health 

0.51% 

Thermo Fisher Scientific Inc. 

BRE 

real estate 

1.66% 

Brookfield Real Estate Serv. 

NAV 

cyclical 

0.49% 

Navistar International Corp. 

PEI 

real estate 

1.54% 

Pennsylvania RIT 

CSL 

basic 

0.47% 

Carlisle Companies Inc. 

FMBI 

financial 

1.5% 

First Midwest Bancorp. Inc. 

IRF 

tech 

0.47% 

International Rectifier Corp. 

PRK 

financial 

1.5% 

Park National Corp. 

APD 

basic 

0.46% 

Air Products &: Chemicals Inc. 

BAG 

financial 

1.42% 

Bank of America Corp. 

PGP 

basic 

0.43% 

Precision Castparts Corp. 

STI 

financial 

1.41% 

SunTrust Banks Inc. 

OMC 

misc services 

0.43% 

Omnicom Group Inc. 

DRE 

real estate 

1.29% 

Duke Realty Corp. 

MXIM 

tech 

0.43% 

Maxim Integrated Products, Inc. 

UBSI 

financial 

1.28% 

United Bankshares Inc. 

TFX 

health 

0.41% 

Telefiex Inc. 

OPT 

real estate 

1.28% 

Camden Property Trust 

NSC 

transport 

0.41% 

Norfolk Southern Corp. 

PPS 

real estate 

1.28% 

Post Properties Inc. 

NBL 

energy 

0.4% 

Noble Energy Inc. 

WABC 

financial 

1.26% 

Westamerica Bancorp. 

SM 

energy 

0.4% 

SM Energy Company 

FMER 

financial 

1.26% 

FirstMerit Corp. 

WMT 

retail 

0.39% 

Wal-Mart Stores Inc. 

CNA 

financial 

1.26% 

CNA Financial Corp. 

CR 

basic 

0.38% 

Crane Co. 

VLY 

financial 

1.25% 

Valley National Bancorp. 

ADI 

tech 

0.38% 

Analog Devices Inc. 

MTB 

financial 

1.24% 

M&T Bankcorp. 

ITW 

cyclical 

0.38% 

Illinois Tool Works Inc. 

WRI 

real estate 

1.23% 

Weingarten Realty Investors 

PPG 

basic 

0.38% 

PPG Industries Inc. 

BDN 

real estate 

1.21% 

Brandywine Realty Trust 

BA 

capital 

0.38% 

The Boeing Company 

ZION 

financial 

1.2% 

Zions Bancorp. 

AME 

tech 

0.38% 

Ametek Inc. 

Total 


27.54% 


Total 


8.53% 



TABLE IV. Top 20 contributing companies to each sector in the two sector decomposition. Ranking is determined by the 
martix Csj which describes each sector as a linear combination of stocks. Labels are those given by Scottrade and percentage 
describes the percentage of the sector attributable to the company. 
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sector 

1 label 

percent 

sector 2 

label 

percent 

sector 3 

label percent 

XOM 

energy 

1.29% 

BRE 

real estate 

2.16% 

IRF 

tech 

1.29% 

HP 

energy 

1.22% 

PEI 

real estate 

2.08% 

EMC 

tech 

1.22% 

CVX 

energy 

1.21% 

BWS 

retail 

1.99% 

ADI 

tech 

1.21% 

ETR 

utility 

1.2% 

CNA 

financial 

1.79% 

CSCO 

tech 

1.2% 

APD 

basic 

1.2% 

ONB 

financial 

1.73% 

TXN 

tech 

1.2% 

OXY 

energy 

1.19% 

DDR 

real estate 

1.63% 

BMC 

tech 

1.19% 

NFG 

utility 

1.18% 

PRK 

financial 

1.59% 

SNPS 

tech 

1.18% 

PX 

basic 

1.17% 

CBSH 

financial 

1.59% 

PLXS 

tech 

1.17% 

CL 

non-cyclical 1.16% 

BC 

cyclical 

1.56% 

CPWR 

tech 

1.16% 

NBL 

energy 

1.15% 

FMER 

financial 

1.55% 

AVT 

tech 

1.15% 

on 

energy 

1.11% 

RDN 

financial 

1.54% 

SWKS 

tech 

1.11% 

LNT 

utility 

1.11% 

MAS 

capital 

1.54% 

HPQ 

tech 

1.11% 

D 

utility 

1.08% 

DDS 

retail 

1.47% 

PMCS 

tech 

1.08% 

DTE 

utility 

1.07% 

FMBI 

financial 

1.47% 

MXIM 

tech 

1.07% 

SCO 

utility 

1.06% 

ALK 

transport 

1.46% 

ARW 

tech 

1.06% 

WEC 

utility 

1.04% 

WABC 

financial 

1.43% 

TER 

tech 

1.04% 

APA 

energy 

0.99% 

PCH 

real estate 

1.42% 

ATML 

tech 

0.99% 

BAX 

health 

0.98% 

VLY 

financial 

1.41% 

MCHP 

tech 

0.98% 

MUR 

energy 

0.98% 

BAG 

financial 

1.41% 

LRCX 

tech 

0.98% 

CPB 

non-cyclical 0.98% 

STI 

financial 

1.37% 

CGNX 

tech 

0.98% 

Total 


22.38% 

Total 


19.14% 

Total 


32.18% 


TABLE V. Top 20 contributing companies to each sector in the three sector decomposition. Ranking is determined by the 
martix Csj which describes each sector as a linear combination of stocks. Labels are those given by Scottrade and percentage 
describes the percentage of the sector attributable to the company. 


sector 1 

full name 

sector 2 

full name 

sector 3 

full name 

XOM 

Exxon Mobil Corp. 

BRE 

Brookfield Real Estate Serv. 

IRF 

International Rectifier Corp. 

HP 

Helmerich &: Payne Inc. 

PEI 

Pennsylvania RIT 

EMC 

EMC Corp. 

CVX 

Chevron Corp. 

BWS 

Brown Shoe Co. Inc. 

ADI 

Analog Devices Inc. 

ETR 

Entergy Corp. 

CNA 

CNA Financial Corp. 

CSCO 

Cisco Systems Inc. 

APD 

Air Products & Chemicals Inc. 

ONB 

Old National Bancorp. 

TXN 

Texas Instruments Inc. 

OXY 

Occidental Petroleum 

DDR 

DDR Corp. 

BMC 

BMC Software Inc. 

NFG 

National Fuel Gas Company 

PRK 

Park National Corp. 

SNPS 

Synopsys Inc. 

PX 

Praxair Inc. 

CBSH 

Commerce Baneshares Inc. 

PLXS 

Plexus Corp. 

CL 

Colgate-Palmolive Co. 

BC 

Brunswick Corp. 

CPWR 

Compuware Corp. 

NBL 

Noble Energy Inc. 

FMER 

First Merit Corp. 

AVT 

Avnet Inc. 

OH 

Oceaneering International Inc. 

RDN 

Radian Group Inc. 

SWKS 

Sky works Solutions Inc. 

LNT 

Alliant ENergy Corp. 

MAS 

Masco Corp. 

HPQ 

Hewlett-Packard Company 

D 

Dominion Resources Inc. 

DDS 

Dillard’s Inc. 

PMCS 

PMC-Sierra Inc. 

DTE 

DTE Energy Corp. 

FMBI 

First Midwest Bancorp. Inc. 

MXIM 

Maxim Integrated Products Inc. 

see 

SCANA Corp. 

ALK 

Alaska Air Group Inc. 

ARW 

Arrow Electronics Inc. 

WEC 

Wisconsin Energy Corp. 

WABC 

Westamerica Bancorp. 

TER 

Teradyne Inc. 

APA 

Apache Corp. 

PCH 

Potlatch Corp. 

ATML 

Atmel Corp. 

BAX 

Baxter International Inc. 

VLY 

Valley National Bancorp. 

MCHP 

Microchip Technology Inc. 

MUR 

Murphy Oil Corp. 

BAG 

Bank of America Corp. 

LRCX 

Lam Research Corp. 

CPB 

Campbell Soup Company 

STI 

SunTrust Banks Inc. 

CGNX 

Cognex Corp. 


TABLE VI. Top 20 contributing companies to each sector in the three sector decomposition. Ranking is determined by the 
martix Csj which describes each sector as a linear combination of stocks. 
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FIG. S2. Normalized distribution of singular values. Filled blue histogram corresponds to distribution of singular values 
of returns from the dataset Rts —one notices a clear separation of the hump-shaped bulk of singular values, and about 20 stiff 
singular values (the largest singular value ~952, corresponding to the market mode is not shown). Pink line histogram outline 
shows the distribution of singular values of a matrix of the same shape as R but containing purely random Gaussian entries. 
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FIG. S3. Low-dimensional projections of stock returns data. Each colored circle represents a stock in our dataset 
and is colored according to the listed sectors scheme in (Fig. S5) according to sectors assigned by Scottrade [1]. The hrst row 
is repeated from (Fig. 1). Black circles represent the archetypes found with our analysis. The hgure in the grid is a 

plane spanned by singular vectors i and j + 1 (rows of R) from the calculations described earlier. Projections after the 
factorization are shown in (Fig. S4). 
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FIG. S4. Cross-sections along eigenplanes of the factorized returns. Each colored circle represents a stock in our 
dataset and is colored according to scheme in (Fig. 2) based on the primary sector association found after calculations described 
in this paper. Black circles represent the archetypes found with our analysis. The (i, hgure in the grid is a plane spanned 
by singular vectors i and j + 1 (rows of MN'^) from the calculations described earlier. Projections of raw data (before the 
factorization) are shown in (Fig. S3). 
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FIG. S5. Canonical sector time series. Top row: normalized log returns (columns of Etf), middle row: cumulative log 
returns (same as (Fig. 3) and defined in (Eqn. S2)), and bottom row: unweighted price index of canonical sectors (Eqn S4). 
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FIG. S6. Weight distribution in canonical sectors. Each of the eight subplots shows the constituent participation weights 
of all 705 companies in a canonical sector (rows of W/s). Stocks are colored by listed sectors as shown at the bottom. Listed 
sector information was obtained from [1]. Y-axis range is from 0 to 1. 
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S7. Singular vectors Vj'g of the SVD of returns Rts> The orthonormal right singular vectors (rows of Vj'g) of SVD 


of Rts are equivalent to the eigenvectors of the stock-stock correlation matrix ^gs' ~ R- Eight of these stiffest eigenvectors 
including the market mode are shown in rows of two at a time. Each has 705 components corresponding to stocks in the 
dataset. The market mode with all components in the same direction describes overall fluctuations in the market; it was 
excluded from the analysis described in the paper. Previous work [19] has suggested that each eigenvector of the stock-stock 
correlation matrix describes a listed sector, however as seen above, a more correct interpretation is that each eigenvector is a 
mixture of listed sectors with opposite signs in components. Eor example, the stiffest direction (after market mode) has positive 
components in real estate and utility, but negative in tech. Less stiff eigenvectors (including the last one shown here), do not 
contain sector-relevant information. Stocks are colored by listed sectors as shown at the bottom. Listed sector information was 
obtained from [1]. Y-axis range is from -0.5 to 0.3. 
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FIG. S8. Canonical Sector Constituents (shown as columns of the Cs/). Csf represents a weighted combination of 
stocks that dehnes the canonical sector each of which has a time series represented by Etf that is given hy Etf = RtsCgf- The 
eight subplots show the constituent participation component of stocks in each canonical sector /. Canonical sectors are labeled 
on the plot; their names were chosen according to the listed sectors of hrms that comprise them. Noteworthy features seen 
above include the co-association of listed sectors: basic, capital, transport and part of cyclicals into industrial goods. Similarly, 
healthcare and non-cyclicals are coupled together in what we call non-cyclicals. Canonical retail goes primarily with listed 
retail and cyclicals. Stocks are colored by listed sectors as shown at the bottom. Listed sector information was obtained from 
[1]. Y-axis range is from 0 to 0.05. 
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FIG. S9. 3 Factor Model vs. Fama and French 2D projections of the weights for each company in the SP500 with current 
tickers and data in the date range we consider. Red denotes companies with large market caps (market cap >10 billion), blue 
denotes medium (market cap 2-10 billion) and green denotes small (market cap < 2 billion). For our decomposition (a), there 
is no separation distinguishable by size of company. In comparison, for the Fama and French decomposition (b), there appears 
a gradation from large to small companies consistent with a factor of the model being related to size. (This is natural, since one 
of Fama and French’s factors explicitly is the difference between large and small-cap returns). Thus our unsupervised 3-factor 
decomposition appears quite distinct from Fama and French’s hand-created one. 
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FIG. SIO. Pie charts depicting sectors as linear combinations of other sector decompositions having a different 
value of the dimensionality n. (a) Two sector decomposition with respect to the eight sector version (b) Three with respect 
to eight (c) eight with respect to two (d) eight with respect to three. For (a) and (b) the color scheme is the same as used 
throughout for the eight sector decomposition. For (c) and (d) colors correspond to those in Figure 5 for the two and three 
sector nodes. Through these charts it is evident that the two sector decompositions corresponds to an c-assets sector containing 
c-finance and c-real estate, and a c-goods sector containing companies which provide goods and services. In (c) and (d) we 
see c-industrial, c-cyclical and c-non-cyclical which merge by n = 5 split between the two and three factor decompositions 
respectively, consistent with Figure 5. 



